由于字体之类的文本属性是文档格式和页面样式的核心设计元素,因此自动属性识别有利于全面的实用应用。现有方法在区分不同属性方面已经产生令人满意的性能,但是它们仍然在区分类似属性的情况下只有微妙的差异。此外,在现实世界中出现意外和明显的成像扭曲的现实情况下,他们的性能严重下降。在本文中,我们旨在通过提出炸玉米饼来解决这些问题,炸玉米饼是针对最常见文档场景量身定制的文本属性识别的对比框架。具体而言,炸玉米饼利用对比学习来消除由模糊和开放式属性引起的歧义陷阱。为了实现这一目标,我们从三个角度设计了学习范式:1)生成属性视图,2)提取微妙但至关重要的细节,以及3)利用有价值的视图对学习,以充分解锁预训练潜力。广泛的实验表明,Taco超过了受监督的对应物,并在多个属性识别任务上取得了最新的进步。将提供炸玉米饼的在线服务。
translated by 谷歌翻译
文档信息提取(DIE)由于其在现实世界中的各种高级应用而引起了越来越多的关注。尽管最近的文献已经取得了竞争成果,但在处理具有嘈杂的OCR结果或突变布局的复杂文档时,这些方法通常会失败。本文提出了用于现实世界情景的生成多模式网络(GMN),以解决这些问题,这是一种强大的多模式生成方法,没有预定义的标签类别。借助精心设计的空间编码器和模态感知的蒙版模块,GMN可以处理复杂的文档,这些文档很难序列化为顺序。此外,GMN可以容忍OCR结果中的错误,并且不需要字符级注释,这是至关重要的,因为对众多文档的细粒注释很费力,甚至需要具有专门域知识的注释者。广泛的实验表明,GMN在几个公共模具数据集上实现了新的最新性能,并超过了其他方法,尤其是在现实的场景中。
translated by 谷歌翻译
场景细分和分类(SSC)是迈向视频结构分析领域的关键步骤。直观地,共同学习这两个任务可以通过共享共同信息相互促进。但是,场景细分更多地涉及相邻镜头之间的局部差异,而分类需要场景段的全局表示,这可能导致该模型在训练阶段中由两个任务之一主导。在本文中,从替代角度来克服上述挑战,我们将这两个任务通过一种预测镜头链接的新形式团结到一个任务中:链接连接两个相邻的镜头,表明它们属于同一场景或类别。最后,我们提出了一个一般的单阶段多模式顺序链接框架(OS-MSL),以通过将两个学习任务改革为统一的任务来区分和利用两倍的语义。此外,我们量身定制一个称为diffcorrnet的特定模块,以明确提取镜头之间的差异和相关性信息。对从现实世界应用收集的全新大规模数据集和电影塞恩进行了广泛的实验。两种结果都证明了我们提出的方法对强基础的有效性。
translated by 谷歌翻译
最近,在深图模型的帮助下,表结构识别取得了令人印象深刻的进展。其中大多数利用表格元素的单个视觉线索或通过早期融合来利用其他方式与其他方式结合起来,以推理其图形关系。然而,在多种模式方面既不是早期融合也不是单独的推理,可以适用于具有巨大多样性的表结构。相反,预计不同的方式将以不同的表案例的不同模式相互协作。在社区中,表层结构推理的跨性模特间交互的重要性仍未开发。在本文中,我们将其定义为异构表结构识别(异质-TSR)问题。旨在填补这种差距,我们提出了一种配备有堆叠的协作块的新型神经协作图机(NCGM),其替代地提取了模态上下文并以分层方式模拟了模范间交互。它可以代表表格元件的帧内模特关系更加强大,这显着提高了识别性能。我们还表明,所提出的NCGM可以调制在模态线索的背景下调节不同方式的不同方式的协同模式,这对于多元化表案例至关重要。基准测试的实验结果证明了我们所提出的NCGM实现最先进的性能,并通过较大的余量击败其他当代方法,特别是在挑战性的情况下。
translated by 谷歌翻译
最近,视觉变压器(VIT),具有自我关注(SA)作为事实上的成分,在计算机视觉社区中表现出很大的潜力。为了在效率和性能之间进行权衡,一组作品仅仅在本地补丁中执行SA操作,而全局上下文信息被放弃,这对于可视识别任务是不可或缺的。为了解决这个问题,随后的全球本地VITS在模型中以并行或替代方式将本地SA与全球范围内纳入本地SA。然而,令人遗憾地组合的局部和全局上下文可能存在各种视觉数据的冗余,并且每个层内的接收场是固定的。或者,更优雅的方式是全局和本地上下文可以自适应地贡献本身以适应不同的视觉数据。为实现这一目标,我们本文提出了一种新的Vit架构,称为NOMMER,可以动态提名视觉变压器中的协同全球本地背景。通过调查我们提出的NOMMER的工作模式,我们进一步探讨了哪些上下文信息。有益于这种“动态提名”机制,没有钟声和吹口哨,不仅可以在Imagenet上达到84.5%的前1个分类准确性,只有73米的参数,也显示了对致密预测任务的有希望的性能,即对象检测和语义分割。代码和模型将在〜\ url {https://github.com/nommer1125/nommer中公开可用。
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
We aim to bridge the gap between our common-sense few-sample human learning and large-data machine learning. We derive a theory of human-like few-shot learning from von-Neuman-Landauer's principle. modelling human learning is difficult as how people learn varies from one to another. Under commonly accepted definitions, we prove that all human or animal few-shot learning, and major models including Free Energy Principle and Bayesian Program Learning that model such learning, approximate our theory, under Church-Turing thesis. We find that deep generative model like variational autoencoder (VAE) can be used to approximate our theory and perform significantly better than baseline models including deep neural networks, for image recognition, low resource language processing, and character recognition.
translated by 谷歌翻译
Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.
translated by 谷歌翻译
We consider infinite horizon Markov decision processes (MDPs) with fast-slow structure, meaning that certain parts of the state space move "fast" (and in a sense, are more influential) while other parts transition more "slowly." Such structure is common in real-world problems where sequential decisions need to be made at high frequencies, yet information that varies at a slower timescale also influences the optimal policy. Examples include: (1) service allocation for a multi-class queue with (slowly varying) stochastic costs, (2) a restless multi-armed bandit with an environmental state, and (3) energy demand response, where both day-ahead and real-time prices play a role in the firm's revenue. Models that fully capture these problems often result in MDPs with large state spaces and large effective time horizons (due to frequent decisions), rendering them computationally intractable. We propose an approximate dynamic programming algorithmic framework based on the idea of "freezing" the slow states, solving a set of simpler finite-horizon MDPs (the lower-level MDPs), and applying value iteration (VI) to an auxiliary MDP that transitions on a slower timescale (the upper-level MDP). We also extend the technique to a function approximation setting, where a feature-based linear architecture is used. On the theoretical side, we analyze the regret incurred by each variant of our frozen-state approach. Finally, we give empirical evidence that the frozen-state approach generates effective policies using just a fraction of the computational cost, while illustrating that simply omitting slow states from the decision modeling is often not a viable heuristic.
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译